Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of extract operations #172

Closed
simonw opened this issue Sep 23, 2020 · 9 comments
Closed

Improve performance of extract operations #172

simonw opened this issue Sep 23, 2020 · 9 comments
Labels
enhancement New feature or request

Comments

@simonw
Copy link
Owner

simonw commented Sep 23, 2020

This command took about 12 minutes (against a 150MB file with 680,000 rows in it):

sqlite-utils extract salaries.db salaries \
   'Organization Group Code' 'Organization Group' \
  --table 'organization_groups' \
  --fk-column 'organization_group_id' \
  --rename 'Organization Group Code' code \
  --rename 'Organization Group' name

I'm pretty confident we can do better than that.

@simonw simonw added the enhancement New feature or request label Sep 23, 2020
@simonw
Copy link
Owner Author

simonw commented Sep 23, 2020

Steps to produce that database:

curl -o salaries.csv 'https://data.sfgov.org/api/views/88g8-5mnd/rows.csv?accessType=DOWNLOAD'
sqlite-utils insert salaries.db salaries salaries.csv --csv

@simonw
Copy link
Owner Author

simonw commented Sep 23, 2020

Here's the loop that's taking the time:

for row in rows_iter:
row_pks = tuple(row[pk] for pk in pks)
lookups = {rename.get(column) or column: row[column] for column in columns}
self.update(row_pks, {first_column: lookup_table.lookup(lookups)})
if progress:
progress(1)

@simonw
Copy link
Owner Author

simonw commented Sep 23, 2020

lookup_table.lookup(lookups) is doing a SQL lookup. This could be cached in-memory, maybe with a LRU cache, to avoid looking up the primary key for records that we have recently used.

The .update() method it is calling first does a get() and then does a SQL UPDATE ... WHERE:

def update(self, pk_values, updates=None, alter=False, conversions=None):
updates = updates or {}
conversions = conversions or {}
if not isinstance(pk_values, (list, tuple)):
pk_values = [pk_values]
# Soundness check that the record exists (raises error if not):
self.get(pk_values)
if not updates:
return self
args = []
sets = []
wheres = []
validate_column_names(updates.keys())
for key, value in updates.items():
sets.append("[{}] = {}".format(key, conversions.get(key, "?")))
args.append(value)
wheres = ["[{}] = ?".format(pk_name) for pk_name in self.pks]
args.extend(pk_values)
sql = "update [{table}] set {sets} where {wheres}".format(
table=self.name, sets=", ".join(sets), wheres=" and ".join(wheres)
)

Batching those updates may have an effect. Or finding a way to skip the .get() since we already know we have a valid record.

@simonw
Copy link
Owner Author

simonw commented Sep 23, 2020

I ran sudo py-spy top -p 123 against the process while it was running and the most time is definitely spent in .update():

Total Samples 1000
GIL: 0.00%, Active: 90.00%, Threads: 1

  %Own   %Total  OwnTime  TotalTime  Function (filename:line)                                                                                                                                  
 38.00%  38.00%    3.85s     3.85s   update (sqlite_utils/db.py:1283)
 27.00%  27.00%    2.12s     2.12s   execute (sqlite_utils/db.py:161)
 10.00%  10.00%   0.890s    0.890s   execute (sqlite_utils/db.py:163)
 10.00%  17.00%   0.870s     1.54s   columns (sqlite_utils/db.py:553)
  0.00%   0.00%   0.110s    0.210s   <listcomp> (sqlite_utils/db.py:554)
  0.00%   3.00%   0.100s    0.320s   table_names (sqlite_utils/db.py:191)
  0.00%   0.00%   0.100s    0.100s   __new__ (<string>:1)

@simonw
Copy link
Owner Author

simonw commented Sep 23, 2020

I wonder if I could make this faster by separating it out into a few steps:

  • Create the new lookup table with all of the distinct rows
  • Add the blank foreign key column
  • run a UPDATE table SET blah_id = (select id from lookup where thang = table.thang)
  • Drop the value columns

@simonw
Copy link
Owner Author

simonw commented Sep 23, 2020

Problem with this approach is it's not compatible with progress bars - but if it's a multiple of times faster it's worth it.

@simonw
Copy link
Owner Author

simonw commented Sep 23, 2020

Also what would happen if the table had new rows added to it while that command was running?

@simonw
Copy link
Owner Author

simonw commented Sep 23, 2020

There's something to be said for making this operation pausable and resumable, especially if I'm going to make it available in a Datasette plugin at some point.

simonw added a commit that referenced this issue Sep 23, 2020
Refs #172 - seems to give me about 20% speedup.
@simonw
Copy link
Owner Author

simonw commented Sep 24, 2020

I wonder if I could make this faster by separating it out into a few steps:

* Create the new lookup table with all of the distinct rows

* Add the blank foreign key column

* run a `UPDATE table SET blah_id = (select id from lookup where thang = table.thang)`

* Drop the value columns

My prototype of this knocked the time down from 10 minutes to 4 seconds, so I think the change is worth it!

% date
sqlite-utils extract salaries.db salaries \
   'Department Code' 'Department' \
  --table 'departments' \
  --fk-column 'department_id' \
  --rename 'Department Code' code \
  --rename 'Department' name
date
sqlite-utils extract salaries.db salaries \
   'Union Code' 'Union' \
  --table 'unions' \
  --fk-column 'union_id' \
  --rename 'Union Code' code \
  --rename 'Union' name
date
sqlite-utils extract salaries.db salaries \
   'Job Family Code' 'Job Family' \
  --table 'job_families' \
  --fk-column 'job_family_id' \
  --rename 'Job Family Code' code \
  --rename 'Job Family' name
date
sqlite-utils extract salaries.db salaries \
   'Job Code' 'Job' \
  --table 'jobs' \
  --fk-column 'job_id' \
  --rename 'Job Code' code \
  --rename 'Job' name
date
Thu Sep 24 00:48:16 PDT 2020

Thu Sep 24 00:48:20 PDT 2020

Thu Sep 24 00:48:24 PDT 2020

Thu Sep 24 00:48:28 PDT 2020

Thu Sep 24 00:48:32 PDT 2020

simonw added a commit that referenced this issue Sep 24, 2020
Takes my test down from ten minutes to four seconds.
@simonw simonw closed this as completed in 022cdd9 Sep 24, 2020
simonw added a commit that referenced this issue Sep 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant